Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static llm pipeline dynamic shape model #1240

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

AsyaPronina
Copy link
Contributor

@AsyaPronina AsyaPronina commented Nov 20, 2024

@github-actions github-actions bot added category: LLM LLM pipeline (stateful, static) category: samples GenAI samples labels Nov 20, 2024
@AsyaPronina AsyaPronina marked this pull request as draft November 20, 2024 19:26
Comment on lines 806 to 784
int64_t position_ids_data = prompt_len -1;
std::vector<int64_t> attention_mask_data(1, prompt_len);
int64_t position_ids_data = prompt_len - 1;
std::vector<int64_t> attention_mask_data(prompt_len - 1, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL, @TolyaTalamanov !!

@AsyaPronina AsyaPronina force-pushed the at/static-llm-pipeline-dynamic-shape-model branch from 6cdd518 to cc34616 Compare November 27, 2024 15:41
@github-actions github-actions bot removed the category: samples GenAI samples label Nov 30, 2024
@AsyaPronina AsyaPronina marked this pull request as ready for review November 30, 2024 00:45
@AsyaPronina AsyaPronina force-pushed the at/static-llm-pipeline-dynamic-shape-model branch from 306ab4a to 7c8ff06 Compare November 30, 2024 03:10
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this pull request Nov 30, 2024
### Details:
 - *item1*
 - *...*
 
### Related PRs:
- GenAI: *openvinotoolkit/openvino.genai#1240
### Tickets:
 - *ticket-id*

---------

Co-authored-by: TolyaTalamanov <[email protected]>
@dmatveev dmatveev added this to the 2025.0 milestone Dec 5, 2024
@dmatveev dmatveev self-assigned this Dec 5, 2024
const ov::AnyMap& config);
};

class SMStaticLLMPipeline : public LLMPipelineImplBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMStaticLLMPipeline sounds misleading, let's not focus on the point that it is a "single model" pipeline (as people got used to do a different thing here).

The CPU/GPU's pipeline is called Stateful* if I get it right.

So as this one is still static, let's call it StaticStatefulLLMPipeline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Comment on lines 11 to 13
namespace genai {

struct StaticLLMPipelineFactory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there could be a better namespace job done, clearly statik:: could be a namespace here, so we'd get statik::StatelessLLMPipeline (the old class) and statik::StatefulLLMPipeline (the new one).

statik:: is picked to avoid the clash with the keyword, it could be Static:: too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Contributor Author

@AsyaPronina AsyaPronina Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmatveev @TolyaTalamanov , please help to disambiguate: we allow user to pass dynamic stateful OpenVINO model into our new pipeline, where we are hiding things like converting of model to static and making it stateless. Should we this way still name the pipeline as static_llm::StatefulLLMPipeline, as it still works with the static and stateless models inside? Or it can really be named as Stateful because now, LLMCompiledModel, which it creates, doesn't expose to user additional inputs and outputs, that correspond to states. (However, I don't know if by this logic the pipeline is still static)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there could be a better namespace job done, clearly statik:: could be a namespace here, so we'd get statik::StatelessLLMPipeline (the old class) and statik::StatefulLLMPipeline (the new one).

statik:: is picked to avoid the clash with the keyword, it could be Static:: too.

Why not just npu_impl?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AsyaPronina The new pipeline is definitely stateful as pipeline doesn't handle dynamism and states any longer. Nobody cares what is inside


update_config(properties, {"NPU_USE_NPUW", "YES"});
update_config(properties, {"NPUW_LLM", "YES"});
update_config(properties, {"NPUW_LLM_MODEL_DESC", model_desc_to_string(model_desc)});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it is C++, can we use a C++ structure directly here? Or the option is exposed as string only?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A downside - change in the option structure when it is a string will never break the build, but a change in the structure type will.

Copy link
Contributor Author

@AsyaPronina AsyaPronina Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but it is not obvious where to define this structure, since OpenVINO NPUW code is not exposed as Public API. However, we can pass it as map<std::string, std::string> or map<std::string, ov::Any>, if you think it is better. Current implementation via std::string ensures that compiled_model.get_property("NPUW_LLM_MODEL_DESC") will print something meaningful (but it is not a requirement).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

Comment on lines 636 to 637
const uint32_t kMaxPromptLen = pop_int_and_cast(properties, "MAX_PROMPT_LEN").value_or(1024u);
const uint32_t kMinResponseLen = pop_int_and_cast(properties, "MIN_RESPONSE_LEN").value_or(128u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the above code, it was aligned to 64. Probably it makes sense to unify how these options are handled between the two classes (without overdesign)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did the alignment inside of the LLMCompiledModel constructor in the NPUW. Do you think I need to remove it there and instead do alignment here? I did it in LLMCompiledModel, since I thought it might be of implementation detail..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

Comment on lines +682 to +684
auto decode_start_time = std::chrono::steady_clock::now();
DecodedResults decoded_results = {m_tokenizer.decode(encoded_results.tokens), encoded_results.scores};
auto decode_stop_time = std::chrono::steady_clock::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd highly recommend to use smt like https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/perf.cpp#L9 - but it is clearly not for this PR (cc: @TolyaTalamanov )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ofc, I don't mind, should be done for stateful implementation as well

Comment on lines +776 to +777
m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data));
m_request.set_tensor("position_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&position_ids_data));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a point in having these two set all the time - the data could've been updated in-place for these tensors - for the future @TolyaTalamanov

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is true, I think we discussed something like that already. And the conclusion was to check contracts of GenAI pipeline to understand what should be passed into the request. I don't remember the outcome, might be GenAI dynamic pipeline always pass some input, I need to check it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since shape doesn't change, I believe we can create and set tensor once and then only update underlying data.

But StatefulLLMPipeline still calling set_tensor on every iteration:
https://github.com/openvinotoolkit/openvino.genai/blame/master/src/cpp/src/lm_encoding.cpp#L183-L185

Comment on lines 794 to 795

// TODO: How to check that KV-Cache is full?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still an open? I believe the llm_infer_request reports it via throw now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is true! However, I don't know if this should be exception or just end of the generation stage with the warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I believe on GenAI side we set KV caches size explicitly, don't we? So we know its size and how many tokens were processed. Shouldn't be a problem to check if cache is full

const std::string& device,
const ov::AnyMap& config) {
auto properties = config;
const auto use_sm_pipeline = pop_or_default(properties, "USE_SM_PIPELINE", false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it be false or NO? Or maybe it shouldn't a binary option either?

we also need to be careful about the option name choice here.. And tbh I don't have a good name here in mind.

It shouldn't be a public-looking option, that's for sure. Maybe it shouldn't be a configurable option at all but an env var, like we did for memory allocation - but that'd complicate testing in the existing environments.

NPU_PIPELINE = STATEFUL (as opposed to the today's STATELESS)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Will fix this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User can set false and NO , OpenVINO ov::Any can parse it to the bool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@AsyaPronina AsyaPronina force-pushed the at/static-llm-pipeline-dynamic-shape-model branch from 74ceb28 to b00d987 Compare December 24, 2024 01:11
@AsyaPronina AsyaPronina force-pushed the at/static-llm-pipeline-dynamic-shape-model branch from ec8c27f to 34c3fc0 Compare December 24, 2024 02:49
@AsyaPronina AsyaPronina force-pushed the at/static-llm-pipeline-dynamic-shape-model branch from 34c3fc0 to c52bd12 Compare December 24, 2024 02:56
* In the later case ModelDesc is stored in properties.
* This function pops ModelDescr from the the properties and returns a pair of updated properties and ModelDescr.
*/
std::pair<ov::AnyMap, ov::genai::ModelConfigDesc> split_model_descr(const ov::AnyMap& properties) {
std::pair<ov::AnyMap, ov::genai::static_llm::ModelConfigDesc> split_model_descr(const ov::AnyMap& properties) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why additional namespace?

@@ -412,6 +412,16 @@ ov::genai::ModelConfigDesc get_modeldesc_from_json(const std::filesystem::path&
return desc;
}

std::string model_desc_to_string(const ov::genai::static_llm::ModelConfigDesc& model_desc) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should prevent spreading this stub...

We should avoid having ModelConfigDesc at all as it tmp solution to solve two problems:

  1. Specify seq_len dim for KV-cache
  2. Figure out if need to enable V tensors transpose

I'd rather prefer having SEQ_LEN_DIM and OPTIMIZE_V_TENSORS than passing such config as string.

use_blobs = *anyopt;
}
// Using model_str and weights_tesnor with blobs is meaningless.
OPENVINO_ASSERT(!use_blobs, "blobs cannot be used with model string and weights tensor");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then probably it should be exception for the user, right?

Comment on lines +776 to +777
m_request.set_tensor("input_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&input_ids_data));
m_request.set_tensor("position_ids", ov::Tensor(ov::element::i64, ov::Shape{1,1}, (void*)&position_ids_data));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since shape doesn't change, I believe we can create and set tensor once and then only update underlying data.

But StatefulLLMPipeline still calling set_tensor on every iteration:
https://github.com/openvinotoolkit/openvino.genai/blame/master/src/cpp/src/lm_encoding.cpp#L183-L185

}

void StatefulLLMPipeline::start_chat(const std::string& system_message) {
// FIXME: Implement later
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it should be an exception that currently chat isn't supported for ...

const ov::AnyMap& config) {
auto properties = config;
const auto pipeline_mode = pop_or_default(properties, "NPU_PIPELINE", std::string("STATELESS"));
OPENVINO_ASSERT(pipeline_mode == "STATELESS" || pipeline_mode == "STATEFUL",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it should be an exception in this case.

const std::string& device,
const ov::AnyMap& config) {
auto properties = config;
const auto pipeline_mode = pop_or_default(properties, "NPU_PIPELINE", std::string("STATELESS"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also a good practice to transform it to enum


std::unique_ptr<LLMPipelineImplBase>
LLMPipelineFactory::create(const std::filesystem::path& models_path,
const std::string& device,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idents broken

const std::filesystem::path& path,
const ov::genai::Tokenizer& tokenizer,
const std::string& device,
const ov::AnyMap& config
);

StaticLLMPipeline(
StatefulLLMPipeline(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's weird we have static_llm::StatefulLLMPipeline.

Maybe it's time we name it:
npu_impl::StaticLLMPipeline - old dual model pipeline
npu_impl::StatefulLLMPipeline - the new one with dynamic shapes and states

and whatever else...
npu_impl::SingleModelLLMPipeline
npu_impl::StaticWhisperPipeline

CC: @dmatveev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: LLM LLM pipeline (stateful, static) category: NPU
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants